Chapter 1: Gathering Data
Welcome to the online content for the first chapter!
Here, you’ll find additional materials to help you make sense of what you’ve read. I’ll assume that you’ve read each chapter before you come to the website. We’ll revisit some of the details from the chapter, and I’ll show you how the statistical programming language R can be used to do the analyses that I’ve discussed in the chapters.
Usually, to use R, people will download the R software onto their computer, along with a development environment called R Studio, which is a convenient way to program in R. However, to keep things simple for this book, I’ll take advantage of WebR, which allows you to use R in any ordinary web browser, without having to install anything.
Throughout the book, I use fictitious data from the island of Statistica. All of this data is available online, and we’ll use some of this data to reproduce the results that I mention in each chapter.
Creating variables in R
Let’s get started!
In Chapter 1, you’ll recall that we began with the heights of 5 people. This data is in a csv file that’s available here. If you wanted to, you could download this file and open it in any spreadsheet software. But you don’t need to bother doing that, because R can read these files for us.
Try clicking the ‘Run Code’ button below. (Sometimes these buttons will say ‘Loading webR…’ and you’ll need to wait until they say ‘Run Code’ before you can press them.)
The numbers on the left-hand side inside the box are just line numbers of the code, for easy reference. The output appears directly underneath the box, and also has line numbers. In this case, the line numbers match the person numbers in our data, but that’s just because we happen to have numbered the five people from 1 to 5.
The first line of code in the box reads in from the web the csv file containing the data and puts it inside an object that I’ve decided to call ‘first’, because this data is from the first 5 people. The <-
symbol is the ‘assignment operator’ and it’s just an arrow saying ‘put the contents into’ an object called ‘first’.
The second line of code - first
- just tells R to reveal what’s inside this new ‘first’ object that we’ve just created.
Here, ‘first’ is called a dataframe, and this dataframe contains two variables. In R, we use the $ sign to say which variable in the dataframe we want. Try running these:
and
The first$height
instruction gives us all of the values of the height variable within ‘first’, and the first$person
command gives us all of the values of the ‘person’ variable.
Plotting graphs
We can plot the height variable vertically and the person number horizontally using the plot
function, like this:
Manipulating variables
R is very good at processing variables. For example, we could add 10 cm onto everyone’s height or double everyone’s height:
In the first line of output, each person’s height has increased by 10. In the second line of output, each person’s height has doubled.
Perhaps more usefully, we could convert each person’s height from cm to inches by dividing by the conversion factor, 2.54.
Now, these are the five heights in inches (to an unreasonably precise degree of accuracy!).
We can also use functions like the ones below to find the maximum or minimum values of a variable.
These tell us the largest and smallest values within the height variable - i.e. the tallest and shortest people’s heights.
You can even do things like this, to get all of the heights rearranged in order, from smallest to largest:
Using more than one dataframe
Let’s read in the second data set that we met in Chapter 1, which contained the heights of the second group of people. In the code box below, I’ve decided to call the dataframe ‘second’.
In the csv file, I carried on numbering the people from 6, so this time the person numbers don’t match the line numbers.
Use the empty code box below to try out some of the same commands as above on this second dataframe. You can type or paste in whatever you like in this box, and then click ‘Run Code’ to run it. You can press ‘enter’ to get additional lines.
Outliers
Finally, let’s read in the larger datasets that had 25 people in them.
We’ll read in the 25-person dataset that contained an outlier and then the one with the outlier corrected:
We haven’t put the dataframe names afterwards this time, because it would take up 25 lines to list the entire contents of each dataframe, and we don’t need to. (Go back and try adding them if you want to.)
You can’t have spaces in names in R, so it’s common to use full stops instead of spaces. But the name ‘outlier.included’ is just a single dataset.
Now run the code below. Can you see the outlier?
The 10th person was the outlier.
Now try this. What has changed?
The outlier value has now been corrected to 160 cm, instead of 16 cm. And did you notice how the height scale was different on the second one? R chooses sensible scales for the axes for us, unless we specify what we want them to be.
Any problems?
Please note that the code on these pages is always designed to be run in order, from top to bottom. If you were to start somewhere else on the page, or miss out some of it, you might get an error message saying ‘Error: object not found’. For example, if you tried to plot, say, ‘outlier.included’, before you read in that dataframe from the web, then you’d get that error.